Skip to content

Conversation

@ggerganov
Copy link
Member

This is an example of what I think is "chunked prefill". The idea is to not block text-generating slots when a new large prompt comes for processing in parallel.

# start server with 3 parallel slots
./bin/llama-server -m ../models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf -ngl 99 -fa --port 8033 -c 0 -np 3

# generation task with small prompt
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n '{ messages: [{ role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Write quick sort in c++." }], "stream": true }')"

# task with a large prompt
curl --request POST --url http://127.0.0.1:8033/completion --header "Content-Type: application/json" --data '{"prompt": "'"$(printf 'hello %.0s' $(seq 1 8149))"'. I believe the meaning of life is","n_predict": 64, "cache_prompt": true}' | jq

With this PR, the first task is no longer "blocked" by the second long prompt processing task.

Still, I'm not sure how valuable this feature is. Chunking the prompts like this leads to slower overall prompt processing. So even though the server seems more responsive, the total wait time over all requests is longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants